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Can  There  Be  Reliability  without  "Reliability?" 


Abstract 

A  recent  article  by  Pamela  Moss  asks  the  title  question,  "Can  there  be 
validity  without  reliability?"  If  by  reliability  we  mean  only  KR-20  coefficients 
or  inter-rater  correlations,  the  answer  is  yes.  Sometimes  these  particular 
indices  for  evaluating  evidence  suit  the  problem  we  encounter;  sometimes 
they  don't.  If  by  reliability  we  mean  credibility  of  evidence,  where  credibility 
is  defined  as  appropriate  to  the  intended  inference,  the  answer  is  no,  we 
cannot  have  validity  without  reliability.  Because  "validity"  encompasses  the 
process  of  reasoning  as  well  as  the  data,  uncritically  accepting  observations  as 
strong  evidence,  when  they  may  be  incorrect,  misleading,  unrepresentative, 
or  fraudulent,  may  lead  coincidentally  to  correct  conclusions  but  not  to  valid 
ones.  This  paper  discusses  and  illustrates  a  broader  conception  of  "reliability" 
in  educational  assessment,  to  ground  a  deeper  understanding  of  the  issues 
raised  by  Professor  Moss's  question. 

Key  words:  Educational  assessment,  hermeneutics,  reliability,  validity. 


Introduction 


"Can  there  be  validity  without  reliability?"  asks  Pamela  Moss  (1994)  in 
her  article  of  the  same  name  in  the  Educational  Researcher.  Yes,  Professor 
Moss  answers.  She  proposes  a  hermeneutic  approach  to  educational 
assessment — the  validity  of  which,  she  argues,  does  not  depend  on  standard 
test  theory  indicators  of  reliability  such  as  KR-20  coefficients  and  inter-rater 
correlations.  I  agree  that  it  is  possible  have  validity  without  reliability,  if  by 
"reliability"  we  refer  only  to  these  particular  indices  and  others  like  them,  but 
this  is  far  too  narrow  a  conception  of  "reliability."  More  broadly  construed, 
however,  reliability  concerns  the  credibility  and  the  limitations  of  the 
information  from  which  we  wish  to  draw  inferences.  If  we  fail  to  address  this 
concern  in  an  appropriate  manner,  we  fail  to  establish  the  validity  of  those 
inferences.  This  paper  discusses  and  illustrates  a  broader  conception  of 
"reliability"  in  educational  assessment,  to  ground  a  deeper  understanding  of 
the  issues  raised  by  Professor  Moss's  question.  (See  Mislevy,  1994,  for  a  more 
comprehensive  discussion.) 

That  reliability  be  examined  "in  an  appropriate  manner"  is  key,  because 
KR-20s  and  inter-rater  correlations  characterize  the  credibility  of  certain  kinds 
of  data  we  employ  for  certain  kinds  of  inferences  in  educational  assessment, 
but  not  others.  I  applaud  Professor  Moss's  use  of  a  hermeneutic  perspective 
to  gain  insights  into  questions  of  educational  assessment,  because,  as  Goethe 
wrote  in  Spriiche  in  Prosa,  "He  who  is  ignorant  of  foreign  languages  knows 
not  his  own."  Exploring  how  other  fields  deal  with  evidence  and  inference 
can  indeed  help  us  disentangle  the  commingled  concepts  from  statistics, 
psychology,  and  measurement  that  constitute  test  theory  as  we  usually  think 
about  it — to  distinguish  how  we  are  reasoning  from  what  we  are  reasoning 
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about — to  better  prepare  ourselves  to  tackle  problems  of  how  to  characterize 
students'  learning  beyond  scores  on  standardized  tests,  how  to  evoke  and 
interpret  evidence  to  this  end,  and  how  to  establish  the  weight  and  coverage 
of  data  as  evidence  for  conjectures  and  decisions  framed  in  these  terms. 
Physical  measurement  has  long  been  a  source  of  concepts  and  techniques  for 
educational  assessment  (e.g.,  Rasch,  1960/1980).  In  addition  to  the 
hermeneutic  tradition,  we  can  also  gain  insights  from  fields  such  as  medicine, 
history,  and  jurisprudence1  (Schum,  1987).  Seeing  how  "reliability  problems" 
arise  and  how  they  are  dealt  with  in  these  fields  helps  us  understand  their 
appearance  in  our  own. 

What  is  Reliability? 

We  can  think  of  increasingly  general  senses  of  the  term  "reliability"  as 
it  relates  to  educational  assessment: 

.  True-score  reliability  (Gulliksen,  1950).  The  classical  reliability 
coefficient  rho  assumes  repeatable  observations  comprised  of  an  examinee's 
"true  score"  and  a  random  "measurement  error."  Rho  is  the  proportion  of 
variance  in  a  particular  population  of  examinees'  observed  scores  attributable 
to  the  variance  of  their  true-scores.  The  data  are  equally-valued  responses  to 
interchangeable  tasks,  constituting  a  source  of  potentially  collaborating 
evidence,  or  more  evidence  of  the  same  kind  about  a  given  inference.  Rho 
does  in  fact  gauge  observed  scores'  weight  of  evidence — for  the  inference  of 


1  In  Analysis  of  evidence  ,  Anderson  and  Twining  (1991)  use  analogies  from  educational  testing 
to  help  law  students  learn  distinctions  among  rules,  criteria,  standards  for  evaluating  evidence. 
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lining  up  people  from  this  particular  population  along  the  true-score  scale.  It 
does  in  fact  bound  "validity" — for  the  inference  of  predicting  a  variable 
related  linearly  to  true  score,  with  "validity"  defined  as  the  correlation 
between  the  scores  and  predicted  variables  in  this  particular  population.  But 
even  under  classical  test  theory,  rho  need  not  convey  the  evidential  value  of 
scores  for  other  inferences — for  example,  the  magnitude  of  change  in  true 
score  from  pretest  to  posttest,  or  whether  a  student's  true  score  is  above  a 
specified  cutoff  value. 

Reproducibility.  We  can  extend  "reliability"  beyond  this  specific  and 
population-bound  inference,  yet  retain  grounding  in  the  consistency  of 
exchangeable  (equally-informative  and  equally-valued)  independent  sources. 
Experimenters  attempting  to  reproduce  Pons  and  Fleishmann's  purported 
cold  fusion  results  faced  reliability  concerns  in  this  sense:  "The  way  to 
circumvent  this  skittishness  [of  BF3  neutron  counters]  was  to  use  two 
counters,  or  even  five  or  six,  and  only  pay  attention  to  those  events  in  which 
all  the  detectors  fired  simultaneously"  (Taubes,  1993,  p.  450).  Before  detailing 
investigations  of  witnesses  to  the  Kennedy  assassination,  Gerald  Posner 
(1993,  p.  236)  summarized  an  overarching  pattern: 

Flow  many  shots  were  fired  at  Dealey  Plaza?  ...  Estimates  at  the  scene 
ranged  from  one  to  eight.  However,  on  this  issue,  there  was  more 
agreement  than  on  any  other  postassassination  matter.  Of  the  nearly 
two  hundred  witnesses  ...  over  88  percent  heard  three  shots.  ... 
Although  almost  every  conspiracy  theory  proposes  that  more  than  one 
assassin  relies  on  there  having  been  four  or  more  shots,  the  writers 
seldom  disclose  that  fewer  than  one  in  twenty  witnesses  heard  that 


many. 
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In  educational  measurement,  proportions  of  agreement  among  raters, 
decision-consistency  coefficients,  and  generalizability  coefficients  (Cronbach  et 
ah,  1972)  reflect  this  sense  of  reliability.  These  indices  characterize  the  weight 
of  evidence  for  inferences  within  the  true-score  test  theory  paradigm  that  are 
not  addressed  by  rho.  They  can  be  useful  even  if  one  doesn't  literally  believe 
observations  are  exchangeable.  The  "Concentration"  section  of  each 
Advanced  Placement  Studio  Art  portfolio  is  rated  by  two  judges 
independently,  and  only  portfolios  that  provoke  excessive  differences  are 
probed  further.  If  ensuing  discussion  reveals  one  judge  differed  because  of 
special  knowledge  of,  say,  glazing  techniques,  this  information  impacts  the 
deliberation.  The  exchangeability  framework  provides  indices  of  similarity 
among  judges'  evaluations,  but  just  as  importantly,  it  highlights  particulars 
where  exchangeability  is  not  a  plausible  approximation,  to  direct  attention 
and  expertise  where  they  are  most  needed  (Myford  &  Mislevy,  in  press). 

Differential  likelihood.  "A  datum  becomes  evidence  in  some  analytic 
problem  when  its  relevance  to  one  or  more  hypotheses  being  considered  is 
established.  . . .  [E] vidence  is  relevant  on  some  hypothesis  if  it  either  increases 
or  decreases  the  likeliness  of  the  hypothesis"  (Schum,  1987,  p.  16).  Under 
probability-based  reasoning,  the  relative  likelihood  of  an  observation  under 
alternative  "true  states"  is  the  weight  of  evidence  it  provides  for  each; 
"reliable"  observations  make  sharp  distinctions  among  the  possibilities.2  The 


2  Agreeing  too  much  on  key  points,  along  with  agreeing  too  little  on  tangential  issues,  lowers 
the  credibility  of  suspected  collaborators  in  a  criminal  investigation;  this  pattern  is  likely 
under  the  hypothesis  of  a  rehersed  alibi.  Reproducability  does  not  equal  to  credibility,  since  a 
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empirical  consistency  discussed  above  is  one  way  to  ground  likelihoods;  we 
take  a  BF3  burst  with  a  grain  of  salt  once  we  know  a  bump  is  as  likely  to  cause 
one  as  an  actual  neutron.  Theoretical  and  subjective  considerations  can  also 
provide  information  about  relative  likelihoods.  The  MUNIN  neuro¬ 
muscular  disease  diagnostic  system  uses  conditional  probabilities  for  test 
results  and  symptoms  given  disease  states,  which  are  based  on  clinical 
experience  and  physiological  theory  (Andreassen,  Jensen,  &  Olesen,  1990). 
Failure  Analysis  Associates'  "probability  cones"  (Figure  1)  for  sources  of  shots 
in  the  Kennedy  assassination  extend  uncertainties  in  positions  and  angles 
backwards  from  the  points  of  impact  (Posner,  1993,  p.  476). 

[Figure  1] 

We  use  similar  reasoning  to  convey  our  uncertainty  about  a  student's 
proficiency  under  an  item  response  theory  model,  or  her  stage  of  proportional 
reasoning  under  a  latent  class  model.  We  obtain  in  these  cases  numerical 
assessments  of  the  evidential  value  (read  "reliability")  of  the  data  but  only 
if,  perhaps  after  considerable  effort,  we  can  arrange  circumstances  in  which 
our  data,  our  model,  and  our  intentions  cohere  (Wright  &  Stone,  1979).  Less 
formally,  a  tutor  constructs  a  model  for  a  student's  understanding,  probing 
"What  organization  does  the  student  have  in  mind  so  that  his  actions  seem, 
to  him,  to  form  a  coherent  pattern?"  (Thompson,  1982).  "Reliable"  data  allow 
the  tutor  to  identify  a  perspective  from  which  the  student's  pattern  of  actions 
make  sense,  but  are  unlikely  from  relevant  alternative  perspectives. 


conspirator  can  repeat  a  lie  100  times.  Data  "too  good  to  be  true"  toppled  Cyril  Burt  (Kamin, 
1974). 
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This  is  not  "reliability"  in  the  sense  of  accumulating  collaborating 
evidence,  as  in  classical  test  theory,  but  in  the  sense  of  converging  evidence- 
accumulating  evidence  of  different  types  that  support  the  same  inference.  A 
mass  of  data  is  more  reliable  in  this  sense  as  more  aspects  support  a  given 
inference  and  fewer  aspects  conflict  or  contradict  it;  it  is  less  reliable  if  when  it 
is  internally  inconsistent  or  equivocal,  or  if  we  realize  that  securing 
additional  information  would  cause  us  to  revise  our  beliefs  substantially. 

Such  considerations  characterize  the  reliability  of  the  evidence  supporting  a 
legal  case,  and  jurists  and  statisticians  have  explored  the  means  by  which,  and 
the  extent  to  which,  they  can  be  expressed  in  terms  of  diffemtial  likelihoods 
(e.g.,  Kadane  &  Schum,  1992). 

Credibility.  In  common  parlance,  reliability  simply  means  the  extent  to 
which  information  can  be  trusted,  a  concern  clearly  broader  than  traditional 
educational  measurement  situations.  The  world  constantly  confronts  us  with 
unrepeatable  observations  and  non-exchangeable  sources,  which  we  must 
interpret  as  best  we  can  if  we  have  no  alternative  (there  was  only  one  trial  of 
the  Kennedy  assassination),  or  learn  from  to  develop  more  principled  ways  of 
gathering  and  interpreting  information  (to  assess  prospects  of  cold  fusion  or 
students'  understandings  of  proportional  reasoning). 

When  sources  are  not  exchangeable,  we  must  unravel  secondary 
sources  of  information  about  their  credibilities: 

•  Not  all  cold  fusion  experiments  are  created  equal;  those  with  better 
controls  and  more  reliable  measuring  instruments,  or  incorporating 
lessons  from  earlier  experiments,  are  privileged.  Early  positive  results 
were  traced  to  experimental  mistakes  and  interpretational  errors,  in 
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which  questionable  data  were  consistently  accepted  as  evidence  of 
desired  outcomes  (Taubes,  1993). 3 

•  Lincoln  at  Gettysburg  (Wills,  1992)  is  mainly  a  hermeneutic  analysis  of 
what  Lincoln  meant  when  he  presented  the  Gettysburg  Address,  but  its 
Appendix  I  explores  what  he  actually  said.  Five  versions  in  Lincoln's 
hand  and  four  newspaper  transcriptions  survive.  Unanimity  about  a 
phrase  suggests  he  spoke  it  as  such,  but  for  discrepancies  Wills  must 
consider  such  clues  as  these:  The  draft  Lincoln's  secretary  claimed  he 
saw  Lincoln  speak  from  appears  on  Executive  Mansion  letterhead, 
corroborating  eyewitness  accounts,  but  omits  key  phrases  all 
newspapers  report  and  garbles  the  transition  between  pages. 

We  must  often  integrate  multiple  strands  of  evidence,  and  "reliability" 
typically  refers  to  the  weight  of  evidence  of  a  particular  strand.  Influence 
diagrams  in  troubleshooting  (e.g.,  Klempner  et  al.,  1991),  medical  diagnosis 
(Andreasson  et  al.,  1990),  and  legal  reasoning  (Wigmore,  1937)  depict  how 
sources  and  credibilities  of  information  relate  to  inferences.  Temperature  is 
one  strand  of  evidence  in  determining  whether  a  child's  infection  is  bacterial 
or  viral  (Figure  2).  A  thermometer  reading  is  direct  evidence  about 
temperature,  and  the  "reliability"  of  the  thermometer  concerns  its  credibility 
about  this  symptom.  The  reading  is  indirect  evidence  about  nature  of  illness. 


3  A  joke  made  the  rounds  of  experimental  labs:  "Q:  Why  can't  most  people  get  heat,  neutrons, 
and  tritium  [putative  evidence  of  cold  fusion]  at  the  same  time?  A:  It's  almost  impossible  to 
make  that  many  mistakes  at  once"  (Taubes,  1993,  p.  468). 
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In  conjunction  with  other  evidence,  even  a  hand  on  his  forehead — an 
"unreliable  thermometer" — can  aid  in  the  diagnosis. 

[Figure  2] 

In  this  light,  the  irony  is  not  that  test  administrators  warn  test  users 
against  interpreting  scores  without  other  sources  of  information,  but  that  the 
test  users  themselves  are  most  prone  to  reify  "traits"  such  as  "IQ"  or  "writing 
ability."  The  view  among  contemporary  researchers,  whose  work  is 
beginning  to  influence  the  next  generation  of  tests,  substantiates  the  caveat: 

The  evidence  from  cognitive  psychology  suggests  that  test 
performances  are  comprised  of  complex  assemblies  of  component 
information-processing  actions  that  are  adapted  to  task  requirements 
during  performance...  Whatever  their  practical  value  as  summaries, 
for  selection,  classification,  certification,  or  program  evaluation,  the 
cognitive  psychological  view  is  that  such  [trait-based]  interpretations  no 
longer  suffice  as  scientific  explanations  of  aptitude  and  achievement 
constructs.  (Snow  &  Lohman,  1989,  p.  317). 

Conclusion 

Can  we  have  validity  without  reliability?  If  by  reliability  we  mean  only 
KR-20  coefficients  or  inter-rater  correlations,  the  answer  is  yes.  Sometimes 
these  particular  indices  for  evaluating  evidence  suit  the  problem  we 
encounter;  sometimes  they  don't.  But  when  multiple  sources  of  evidence  are 
available  and  they  don't  agree,  we'd  better  have  alternative  lines  of 
argumentation  to  establish  the  weight  and  relevance  of  the  evidence  to  the 
inference  being  drawn.  Sometimes  people  disagree  because  they  focus  on 
different  aspects  of  a  situation  from  different  perspectives,  which  need  to  be 
integrated  in  a  more  thoughtful  way  than  averaging.  But  sometimes  people 
disagree  because  they  are  uninformed  or  biased,  because  their  task  is  not 
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clearly  specified,  or  because  they  are  dishonest.  We  bear  the  burden  of 
unraveling  these  possibilities. 

If  by  reliability  we  mean  credibility  of  evidence,  where  credibility  is 
defined  as  appropriate  to  the  inference,  the  answer  is  no,  we  cannot  have 
validity  without  reliability.  Because  "validity"  encompasses  the  process  of 
reasoning  as  well  as  the  data,  uncritically  accepting  observations  as  strong 
evidence,  when  they  may  be  incorrect,  misleading,  unrepresentative,  or 
fraudulent,  may  lead  coincidentally  to  correct  conclusions  but  not  to  valid 
ones.  Good  intentions  and  plausible  theories  are  not  enough  to  honestly 
evaluate  and  subsequently  improve  our  efforts.  That  familiar  tools  for 
establishing  the  credibility  of  evidence  in  educational  assessment  do  not  span 
the  full  range  of  inferences  does  not  negate  the  responsibility  to  establish  the 
credibility  of  evidence  upon  which  educational  decisions  are  made.  If 
anything,  our  task  becomes  harder  rather  than  easier. 
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